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Summary. The ECME algorithm has proven to be an effective way of accelerating the EM 
algorithm for many problems. Recognising the limitation of using prefixed acceleration sub- 
spaces in ECME, we propose a new Dynamic ECME (DECME) algorithm which allows the 
acceleration subspaces to be chosen dynamically. Our investigation of the classical Succes- 
sive Overrelaxation (SOR) method, which can be considered as a special case of DECME, 
leads to an efficient, simple, stable, and widely applicable DECME implementation, called 
DECME_v1. The fast convergence of DECME_v1 is established by the theoretical result that, 
in a small neighbourhood of the maximum likelihood estimate (MLE), DECME_v1 is equivalent 
to a conjugate direction method. Numerical results show that DECME_v1 and its two variants 
often converge faster than EM by a factor of one hundred in terms of number of iterations and 
a factor of thirty in terms of CPU time when EM is very slow. 

Keywords: Conjugate direction; EM algorithm; ECM algorithm; ECME algorithm; Successive 
overrelaxation. 



1. Introduction 



After its booming popularity of 30 years since the publication of lDempster et al. (19771 ) , the 
EM algorithm is still expanding its application scope in various areas. At the same time, to 
overcome the slow convergence of EM, quite a few extensions of EM have been developed in 
such a way that they run faster than EM while maintain ing its widely recognised simplicity 
and stability. We refer to IVaradhan and Roland (20081 ) for a recent nice review of various 
methods for accelerating EM. In the present pa per, we start by exploring the convergence 
of the ECME algorithm (|Liu and Rubin. 19941) . which has prov ed to be a simple and ef- 



fective method to accelerate its parent EM algorithm (see, e.g., 
Kowalski et al.. 1997t iPinheiro et al.. 20011 ) , to name a few. 



Sammel and Ryan. 19961 



ECME is a simple extension of the ECM algorithm (jMeng and Rubin. 19931 ) which itself 
is an extension of EM. These three algorithms are summarised as follows. Let Y b s be the 
observed data. Denote by L(0\Y o i, s ), 9 6 9 C TZ P , the observed log- likelihood function 
of 9. The problem is to find the MLE 9 that maximises L(9\Y i >s ). Let Y = (Yobs, Ymis) 
represent the complete data with Y b s augmented by the missing data Y m i S . As an iterative 
algorithm, the ith iteration of EM consists of the E-step, which computes Q(0\Y o b s , 9t-x)i 
the expected complete-data log-likelihood function given the observed data and the current 
estimate 9 t -\ of 9, and the M-step, which finds 9 = 9 t to maximise Q(9\Y t s ,9 t -i). 
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The ECM algorithm replaces the M-step with a sequence of simpler constrained or 
conditional maximisation (CM) steps, indexed by s = 1, • • ■ , S, each of which fixes some 
function of 9, h s (9). The ECME algorithm further partitions the S CM-steps into two 
groups 5?q and 5?l with U .Y L = {1, ■ • • , S}. While the CM-steps indexed by s G 5?q 
(refereed to as the MQ-steps) remain the same with ECM, the CM-steps indexed by s £ ,5^h 
(refereed to as the ML-steps) maximise L(9\Y b s ) in the subspace induced by h s {9). A 
more general framework that includes ECM and ECME as special cases is developed in 



Meng and van Dvk (19971) . However, most of the practical algorithms developed under this 
umbrella belong to the scope of a simple case, i.e., the parameter constraints are formed 
by creating a partition, of 6 as • • • , 9s) with associated dimensions (d\, • • • , ds)- 
Mathematically we have h s (9) = {B\, • • • , 6 s -i, &s+i, • • • , 0s) for s = 1, • • • , 5. 

The advantage of ECME over EM in terms of efficiency depends on the relationship 
between the slowest converging directions of EM and the acceleration subspaces of ECME, 
i.e., the subspaces for the ML-steps. For example, when the former is effectively embedded 
within the latter, ECME achieves its superior gain of efficiency over its parent EM. In 
practise, we usually have no information about the convergence of EM before obtaining the 
MLE and cannot select the prefixed acceleration subspaces of ECME accordingly. Hence 
small or minor efficiency gain by ECME is expected in some situations. This is illustrated by 
the two examples in Section 2 and motivates the idea of dynamically constructing subspaces 
for applying the ML-step. This idea is formulated as the generic DECME algorithm. It 
includes SOR as a special case. SOR was first developed as an accelerato r for a class 



of iterative solvers of linear systems in 1950's (IFrankel. 1950 ; Young. 1954). The same 



idea has been frequently explored in the context of EM ( Salakhutdinov and Roweis. 20031 : 



Hesterberg. 20051 . among many others although sometimes under different names). However, 
as shown later, SOR suffers from what is known as the zigzagging problem. Hence it is often 
inefficient. 

Motivated by the zigzagging phenomenon observed on SOR, we propose an efficient 
DECME implementation, called DECME_vl. It is shown that, under some common assump- 
tions, DECME_vl is equivalent to a conjugate directio n method, which has been proposed in 
several different contexts , e.g., solving linear sys te ms ( Concus et al.. 1976f) and non orthogo- 



nal analysis of variance (IGolub and Nash. 19821) . Ijamshidian and Jennrich (19931 ) propose 



to use the conjugate direction method to accelerate EM. They call the resulting method 
AEM and demonstrate its dramatically improved efficiency. However, AEM is not as pop- 
ular as one would expect it to be. This is perhaps due to its demands for extra efforts for 
coding the gradient vector of L(9\Y b s ), which is problem specific and can be expensive to 
evaluate. 

Compared to AEM, DECME_vl is simpler to implement because it does not require 
computing the gradient of L(8\Y b s ). It does require function evaluations, which are typically 
coded with EM implementation for debugging and monitoring convergence. As SOR, the 
only extra requirement for implementing DECME_vl is a simple line search scheme. Such a 
line search scheme can be used for almost all EM algorithms for different models. To reduce 
the number of function evaluations, two variants of DECME.vl, called DECME_v2 and 
DECME_v3, are also considered. Numerical results show that all the three new DECME 
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implementations obtain dramatic efficiency improvement over EM, ECME, and SOR in 
terms of both number of iterations and CPU time. 

The remaining of the paper is arranged as follows. Section[5]provides a pair of motivating 
ECME examples. Section [3] defines the generic DECME algorithm, discusses the conver- 
gence of SOR, and proposes the three efficient novel implementations of DECME. Section 
2] presents several numerical examples to compare the performance of different methods. 
Section [5] concludes with a few remarks. 



2. Two Motivating ECME Examples 



Following iDempster et al. (19771 ). in a small neighbourhood of 9, we have approximately 

6-e t = DM EM {6-6 t _ 1 ), (1) 

where the pxp matrix DM EM is known as the missing information fraction and determines 
the convergence rate of EM. More specifically, each eigenvalue of DAI EM determines the 
convergence rate of EM along the direction of its corresponding eigenvector (see review in 
Appendix [Bj . 



It is shown in iLiu and Rubin ( 19941 ) that ECME also has a linear convergence rate de- 
termined by the p x p matrix DM ECME that plays the same role for ECME as DM EM does 
for EM. Obviously, ECME will be faster than EM if the largest eigenvalue of DM ECME 
is smaller than that of DM EM . With the following two examples we illustrate that it 
is the choice of the acceleration subspaces by ECME that determines the relative mag- 
nitude of the dominating eigenvalues of DM EM and DM ECME , and hence the relative 
effi ciency of EM and ECME. All the numerical examples in this paper are implemented in 



R l|R Development Core Team. 20081) . 



2.1. A Linear Mixed-effects Model Example 



Consider the rat population growth data in lGelfand et al. (1990L Tables 3, 4). Sixty young 
rats were assigned to a control group and a treatment group with n = 30 rats in each. The 
weight of each rat was measured at ages x = 8, 15, 22, 29 and 36 days. We denote by 
yf the weights of the ith rat in group g with g = c for the co ntrol group and g = t for 



the treatmen t group. T he following linear mixed-effects model ([Laird and Ware. 19821 ) is 
considered in lLiu ( 19981 ): 



y?\6~N(Xp g +Xbf, o 2 q h), V>~N(Q,V) 



(2) 



for i = 1, • • • ,n and g = c and t, where X is the 5x2 design matrix with a vector 
of ones as its first column and the vector of the five age-points as its second column, 
Pg = (/3 g ,i, /3 g ,2)' contains the fixed effects, b\ = (b^.b^)' contains the random effects, 
$ > is the 2x2 covariance matrix of the random effects, and 9 is the vector of the param- 
eters, that is, 9= (/3 c ,i I /3c,2,/3t,i,A,2,*ia,*i,2,*2, 2 ,CT c 2 ,fT f 2 )'. Let p= (^,1,^,2,^,1,^,2)' 



and 



= Wi 



[3 = (0,0,0,0)', a 2 
SectionHU 



The starting point for running EM and ECME is chosen to be 
(1,1)', and ^ = J2. The stopping criterion used here is given in 



4 Chuanhai Liu 



For this example, ECME converges dramatically faster than EM, as shown in Figures 
[T] and [5] and Tables [S] and [5] Specifically, EM takes 5, 968 iterations and 518 .9 seconds to 
converge. With the same setting, ECME (version 1 in lLiu and Rubin (19941 ) with 9gt> Q = 
(^n, ^12, ^22, c 2 )' and 6&> L = (3) uses only 20 iterations and 1.8 seconds. The gain of 
ECME over EM is explained clearly by the relation between the slow converging directions 
of EM and the partition of the parameter space for ECME. From Table [TJ the two largest 
eigenvalues of DM EM arc 0.9860 and 0.9746, which are close to 1 and make EM converge 
very slow. From Table [21 it is clear that the first four "worst" directions of EM fall entirely 
in the subspace determined by the fixed effect f3. Since 9,y L = /? for ECME, the slow 
convergence of EM induced by the four slowest directions is diminished by implementing 
the ML-step along the subspace of (3. This is clear from the row ECME in Table [TJ where 
we see the four largest eigenvalues of DM EM become in DM ECME while the five small 
eigenvalues of DM EM remain the same for DM ECME . 



2.2. A Factor Analysis Model Example 

Con sider the confirmatory factor analysis model exa mple in Joreskog (19691) , Rubin and Thaver ('. 



and Liu and Rubin (19981) . The data is provided in lLiu and Rubin (19981) and the model is 



as follows. Let Y be the observable nine-dimensional variable on an unobservable variable 
Z consisting of four factors. For n independent observations of Y, we have 

YiKZ^p, o 2 ) ~ N(Zi/3, d\ag(al - ■ ■ ,of)) (3) 

where /J is the 4x9 factor-loading matrix, a 2 = (af, ■ • ■ , er|)' is called the vector of unique- 
nesses, and given (/3, er 2 ), Zi,--- ,Z n are independently and identically distributed with 
Zi r-j N(0,l4),i = 1, - ■ ■ ,n. In the model, there are zero factor loadings on both factor 
4 for variables 1-4 and on factor 3 for variables 5-9. Let f3j., j — 1, •■• ,4 be the four 
rows of (3, then the vector of the 36 free parameters is 9 = (/?i., /3 2 -, /?3, 1-4, ,$4,5-9, cr 2 )'. 



Liu and Rubin (1998 ) provided detailed comparison between EM and ECME. Figure 1 of 
Liu and Rubin (19981 ) shows that the gain of ECME over EM is impressive, but not as 
significant as ECME for the previous linear mixed-effects model example in Section 12.11 

The slow convergence of EM for this example is easy to explain from Table [3] which shows 
that DM EM has multiple eigenvalues close to 1. From Table 0] the eigenvector correspond- 
ing to the dominant eigenvalue of DM EM falls entirely in the subsp ace spanned by (3i and 



/?2 ■ This clearly adds difficulty to the ECME version suggested by iLiu and Rubin (19981) 



where 9,y Q = (/3i., f3 2 . , ,$3,1-4, ,$4,5-9)' and 9,y L = a 2 . For this version of ECME, the eigen- 
values of DM ECME arc given in row ECME-1 of Tabled where we see that the dominant 
eigenvalue of DM EM remains unchanged for DM ECME . To eliminate the effect of the 
slowest direction of EM, we can try another version of ECME by letting 9y Q = a 2 and 
6j> L = /?2-, /?3, 1-4, /?4, 5-9)'- The eigenvalues of DM ECME for this version are given 
in row ECME-2 of Table [3l Although the second version of ECME is more efficient than 
the first version, it is difficult in general to eliminate all the large eigenvalues in DM EM 
by accelerating EM in a fixed subspace. For example, the eigenvector corresponding to the 
second largest eigenvalue of DM EM shown in Table H] is not in the subspace spanned by 
any subset of the parameters. 
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3. 1 . The Generic DECME Algorithm 

As shown in last section, the efficiency gain of ECME over its parent EM based on static 
choices of the acceleration subspaccs may be limited since the slowest converging directions 
of EM depend on both the data and model. It is thus expected to have a great potential 
to construct the acceleration subspaccs dynamically based on, for example, the information 
from past iterations. This idea is formulated as the following generic DECME algorithm. 
At the t th iteration of DECME, the algorithm proceeds as follows. 

The Generic DECME Algorithm: the t th iteration 
Input: 6t-i 

E-step: Same as the E-step of the original EM algorithm; 
M-step: Run the following two steps: 

CM-step: Compute 9 t = argmax e (2(6>|6> f _i) as in the original EM algorithm; 

Dynamic CM-step: Compute 9 t = argmaxg e y t L(9\Y i, s ), where ^ is a low- 
dimensional subspace with 9 t G 



As noted in lMeng and van Dvk (19971 ). the ML-steps in ECME should be carried out 
after the MQ-steps to ensure convergence. Under this condition, ECME with only a single 
ML-step is obviously a special case of DECME. In case multiple ML-steps are performed 
in ECME, a slightly relaxed version of the Dynamic CM-step, i.e., simply computing 9 t 
such that L{9 t \Y obs ) > L(9 t \Y obs ), will still make DECME a generalisation of ECME. In 
cither case, the monotone increase of the likelihood function in DE CME is guaranteed 



by that of the nested EM algorithm (jDempster et al.. 1977c IWu. 19831 ). which ensures the 



stability of DECME. The convergence rate of DECME relies on the structure of the specific 
implementation, i.e., how "V t is constructed. Furthermore, the well-known method of SOR 
can be viewed as a special case of DECME. As shown in Section I3~121 SOR suffers from what 
is known as the zigzagging problem and is, thereby, often inefficient. Section 13.31 proposes 
three efficient alternatives. 



3.2. The SOR method: an Inefficient Special Case of DECME 

Let {9t — Ot-i} represent the linear subspace spanned by 9 t — 9t-i- SOR can be obtained 
by specifying % = 9 t + {9 t - 9 t -i} in the Dynamic CM-step of DECME, i.e., 9 t = 9 t +a t d t , 
dt = 9 t — 9t-i, and a t = argmax a L(9 t + adt\Y b s ). The so-called relaxation factor a t can 
be obtained by a line search. See Figure |4] for an illustration of the SOR iteration. 

The reason that SOR may be used to accelerate EM is clear from the following theorem 
which implies that, in a small neighbourhood of the MLE, a point with larger likelihood 
value can always be found by enlarging the step size of EM: 

Theorem 3.1. In a small neighbourhood of 9, the relaxation factor at of SOR is always 
positive. 
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The proof is given in Appendix [B] and the conservative movement of EM is illustrated in 
Figure [3] for a two-dimensional simulated exampl e. For simplic ity, it has also been proposed 



to choose at as a fixed positive number (e.g., lLange. 19951 ) . We call this version with 
fixed at the SORF method. Let Ai and A p be the largest and smallest eigenvalues of 
I corncobs (see Appendix IB1 for detailed discussion). It is well known that SORF achieves its 
optimal convergence rate (Ai — A p )/(Ai + A p ) if at = 2/(Ai + X p ) — 1 for any t (see, e.g., 



Salakhutdinov and Roweis. 20031) . In the past, the theoretical argument for SOR has been 
mainly based on this fact, which is obviously insufficient. The following theorem provides 
new angles for understanding the convergence of SOR. 

THEOREM 3.2. For a two-dimensional problem (i.e., p = 2) and in a small neighbour- 
hood of 9, the following results hold for SOR: 

1. ) a t = QLt-%; 

2. ) SOR converges at least as fast as the optimal SORF, and the optimal SORF con- 
verges faster than EM; and 

3. ) SOR oscillates around the slowest converging direction of EM; The SOR estimates 
from the odd-numbered iterations lie on the same line and so do those from the even- 
numbered iterations; Furthermore, the two lines intersect at the MLE 6. 

The proof is provided in Appendix [Cj The zigzagging phenomena of SOR revealed by 
conclusion 3 is illustrated in Figure [3j For the case of p > 2, it is interesting to see that 
the relaxation factors at generated from SOR also have a similar oscillating pattern as that 
for p = 2 (conclusion 1). This is illustrated in Figure [5] The top panel of Figure [3] shows 
the relaxation factors for the two-dimensional example used to generate Figure [3] and the 
lower panel shows those for a nine-dimensional simulated example. The nine-dimensional 
example is generated by simulating the behaviour of EM in a small neighbourhood of the 
MLE for the linear mixed-effects model example in Section 12.11 and Section [ 



3.3. DECME.vl and its Variants: Three Efficient DECME Implementations 

3.3.1. The Basic Version: DECME.vl 

The zigzagging problem has long been considered to be one of the major disadvantages 
for optimisation algorithms since the effective movement towards the MLE is usually small 
even if the step size is large. Figure [3] suggests a line search along the line connecting 
the zigzag points, as shown by one of the red dashed lines. For two-dimensional quadratic 
functions, this suggested procedure shown in Figure [3] converges immediately. Although this 
only represents a very rare case in practise, it motivated us to consider efficient DECME 
implementations. 

One way to proceed is to repeat the procedure shown by the red dashed lines in Figure[3j 
i.e., each cycle of the new algorithm includes two iterations of SOR and a line search along 
the line connecting the initial point of the current cycle and the end point of the second 
SOR iteration. Numerical experiments show that this procedure is not very effective. 

Another way to proceed is what we call DECME.vl. DECME_vl retains the procedure 
shown by the red dashed lines in Figure [3] as its first two iterations and is formally defined 
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as follows. At the first iteration of DECME_vl, 9\ is obtained by running one iteration of 
SOR from the starting point 9 a . At the t th iteration of DECME.vl, one iteration of SOR is 
first conducted to obtain 9f OR , followed by a line search along the line connecting 9t-i and 
gSOR ^ Q 0D t a i n Q t _ ^he process is continued for p iterations and restarted with a standard 
SOR iteration. The reason for restarting becomes clear from Theorem 13.31 below, which 
shows that DECME_vl is equivalent to a conjugate direction method. 

Formally, DECME_vl is described in the framework of the generic DECME algorithm 
by implementing the Dynamic CM-step with two line searches as follows (except for the 
iterations where the algorithm is restarted): 

Dynamic CM-step of DECME.vl: the t th iteration 

Substep 1: Calculate 9f OR = 6 t + d^\ where d\ 1] = t - § t -i, and ctp = 
argmax Q L(6»i + adf'\Y obs ); 

Substep 2: Calculate 6 t = 9? OR + a\ 2) d\ 2 \ where df ] = 9^ OR - 9 t _ 2 , and 
= argmax Q i(0f o/? + ad[ 2) \Y obs ) . 

An illustration of the DECME_vl iteration is given in Figure [4j We note that Qt is actually 
the point that maximises L(9\Y b s ) over the two-dimensional subspacc % = 6t—i + {Ot—i ~ 
9t-2, Qt — dt-i} under certain conditions. This can be seen from the proof, given in Appendix 
iDl of the following theorem, which demonstrates the efficiency of DECME_vl. 



Theorem 3.3. In a small neighbourhood of the MLE, DECME_vl with exact line search 
is equivalent to the conjugate direction method AEM. 



Theorem [3T3] implies that DECME.vl is about as efficient as AEM near the MLE in 
terms of the number of iterations. As noted in Section 1, DECME_vl is much easier to 
implement and can be made automatic for almost all EM algorithms, which typically have 
coded likelihood evaluation routines for debugging code and monitoring convergence. A 
line search method is needed for DECME.vl, but can be implemented once for all; whereas 
evaluation of the gradient vector of L(6>|y o b s ) required for AEM is problem specific and thus 
demands substantially more programming efforts. Note also that gradient evaluation can 
be expensive. For example, for comparing different methods in the optimisation literature, 
it is often to count one evaluation of the gradient vector as p function evaluations, where p 
stands for the dimensionality of the parameter space. 

The idea behind DECME.vl is very simi lar to the parallel tangent (PARTAN) method 
for accelerating the steepest descent method (jShah et al.. 19641 ) . PARTAN can be viewed a s 
a particular implementation of the c onjugate gradient method (IFletcher and Reeves. 19641 ) . 
developed based on the method in iHestenes and Stiefel (19521) for solving linear systems. 
It is also worth noting that PARTAN ha s certain advantage over the conjugate gradient 
method as discussed in iLuenberger (20031 p. 257). For example, the convergence of PAR- 
TAN is more reliable than the conjugate gradient method when inexact line search is used 
as is often the case in practise. 
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3.3.2. DECME.v2 and DECME_v3 

DECME.vl requires two line search steps in one iteration. DECME_v2 and DECME_v3, the 
two variants of DECME_vl that require a single line search in each iteration, are obtained 
by specifying a one-dimensional acceleration subspace in the dynamic CM-step as % = 
9t + {Ot - Ot-2} and % = 9 t + {#t-i - Ot-2} , respectively. This is depicted in Figure |U 

There is no much difference among the three newly proposed methods and SOR in 
terms of programming since their main building blocks, the EM iteration and a line search 
scheme, are the same. However, SOR can hardly compete with the new methods for all the 
examples we have observed. Among the three new implementations, DECME_vl usually 
uses the smallest number of iterations to converge while it obviously takes more time to 
run one DECME_vf iteration. Hence when the cost of running a line search, determined 
mainly by the cost of computing the log-likelihood, is low relative to the cost of running 
one EM iteration, DECME_v2 and DECME_v3 may be more efficient than DECME.vl in 
terms of CPU time. These points are shown by the examples in next section. 

4. Numerical Examples 

In this section we use four sets of numerical examples to compare the convergence speed of 
EM, SOR, DECME.vl, DECME_v2, and DECME_v3 in terms of both number of iterations 
and CPU time. 

4. 1. The Setting for the Numerical Experiments 

The line search scheme for all the examples is implemented by making use of the opti- 
mize function in R. The detailed discussion about the configuration of the function and 
how to achieve line search for constrained problems with line search routines designed for 
unconstrained problems is given in Appendix [El 

For the three examples in Section l4~2l (Section I2.1J) . l4~3l f Section I2.2j) . and 14.41 we hrst 
run EM with very stringent stopping critcrions to obtain the maximum log-likelihood l m ax 
for each example. Then we run each of the five (nine for the example in Section r4.2l) different 
algorithms from the same starting point and terminate them when L(9\Y {, S ) is not less than 
lmax — 1CU 6 . The results for these three examples, including number of iterations and CPU 
time, are summarised in Table [5] and [6] The increases in L(9\Y oos ) against the number of 
iteration for each of the three examples are shown in Figures [TJ [5J El and [7] The setting and 
the results for the simulation study in Section l4~5l are slightly different. 

4.2. The Linear Mixed-effects Model Example 

The mixed-effects model example of Section 12.11 is used to illustrate the performance of 
DECME when applied to accelerate both EM and ECME. The increases of L(9\Y b s ) are 
shown in Figure [TJ and [5] For this example, while EM needs 5, 968 iterations and SOR needs 
918 iterations to converge, all three new implementations of DECME needs no more than 
170 iterations with only 104 iterations for DECME_vl. In terms of CPU time all three new 
methods converge about 30 times faster than EM. It is also interesting to see that the new 
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methods works very well for accelerating ECME (especially DECME_vl further reduces the 
number of iterations from 20 to 9) even when ECME already converges much faster than 
EM. 



4.3. The Factor Analysis Model Example 

The factor analysis example of Scction l2~2"l and the same starting point used in lLiu and Rubin (199 
are used here. The increases of L(d\Y t s ) are shown in Figure El For this example, 
while all three new implementations of DECME converges much faster than EM and SOR, 
DECME_vl uses only less than 1% of the number of iterations of EM (55 to 6,672) and is 
about 30 times faster than EM in terms of CPU time (1.2 seconds vs. 33.4 seconds). It is 
also interesting to see that all three new methods pass th e flat period shown in Figure [5] 
much more quickly than both EM and SOR. As discussed in lLiu and Rubin (19981 ). the long 
flat period of EM and SOR before convergence makes it difficult to assess the convergence. 
Clearly, this does not appear to be a problem for the three new methods. 



4.4. A Bivariate t Example 

Let tp(/j,,^,iy) represent a multivariate t distribution with [i, 'J', and v as the mean, the 
covariance matrix, and the degree of freedom, respectively. Finding the MLE of the param- 
eters (/x, 'J, v) is a well known interesting applicatio n of the EM-type algor ithms. 

Here we use the bivariat e t distribution exam ple in lLiu and Rubin (19941) , where the data 
is adapted from Table 1 of ICohen et al. (19931 ). Figure [7] shows the increases of L(9\Y b s ) 
for each algorithm starting from the same point (fx, 'J, v) = ((0, 0)', diag(l, 1), 1). For this 
example, EM converges relatively fast with 293 iterations and 1.4 seconds. But we still 
see the advantage of the new implementations of DECME over EM and SOR: DECME_vl 
uses only 32 iterations and 0.9 second to converge. Also note that SOR uses 1.5 seconds to 
converge, which is slightly more than that of EM. 



4.5. Gaussian Mixture Examples 

The EM algorithm is wildly acknowledged as a powerful method for fitting the mixture mod 
els, which are popular in many different areas such as machine learning and pattern recogni 



tion (e.g., Jordan and Jacobs. 1994 : McLachlan and Krishnan. 1997 : McLachlan and Peel. 200C)[ 
Bishop. 2006T ~ While the slow convergence of EM has been frequently reported for fitting 

mixture mo dels, a few extensions have been proposed for spec i fically accelerating this EM 

application (jLiu and Sun. 1997tlDasgupta and Schulman. 2000l : ICeleux et al.. 2001[|Pilla and Lin 



among others). Here we show that DECME, as an off-the-shelf accelerator, can be easily 
applied to achieve dramatically faster convergence than EM. 

A class of mixtures of two univariate normal densities is used to illustrate the rela- 
tion betwe en the efficiency of EM and the separation of the component populations in the 
mixture in lRedner and Walker (19841 ). Specifically, the mixture has the form of 



p(x\lTl, 7^,^1,^2,01 j 0- !) = 
Pi(x\(H,o?) = 



-1 e -(=-M*) 2 /2a? ! f=1 2 . 



(4) 
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Let 7Ti = 0.3, 7T2 = 0.7, erf = o\ = 1, and /ii = — /i2, then ten random samples of 1,000 
observations were generated from each case of fii — \xi = 6, 4, 3, 2 and 1.5. We ran EM, 
SOR, DECME.vl, DECME_v2, and DECME_v3 from the same starting point t^ 0) = tt^ ] = 
0.5, of ' = 0.5, /i,- ' = 1.5/i;. The algorithms are terminated when ||#t+i — t \\i < 10 -5 , 
where || • ||i represents the l\ norm. The results for number of iterations and CPU time are 
shown in Figure [5] and Figure [§J respectively. 

We see that SOR typically uses about half of the number of iterations of EM while 
the two have very similar performance in terms of CPU time. When EM converges very 
slowly, the three new implementations of DECME can be dramatically faster than EM 
with a factor 100 or more in terms of number of iterations and a factor of 50 or more in 
terms of CPU time. When EM converges very fast, from Figure^ we see some overlapping 
among several methods in terms of CPU time for the first group often simulations, although 
all the accelerators still outperform EM in terms of number of iterations. In practise, fast 
convergence like this is somewhat rare for EM. Notice that all the methods take less than one 
second to converge. Hence, this phenomenon of overlapping should not be used to dismiss 
the advantage of the accelerators. Nevertheless, this may serve as empirical evidence to 
support the idea of accelerating EM only afte r a few EM iterations have been conducted as 
suggested in I Jamshidian and Jennrich (19931 ). 



5. Discussion 

The limitation of using fixed acceleration subspaccs in ECME led to the idea of dynamically 
constructing the subspaces for the supplementary ML-steps. We formulated this idea as 
the generic DECME algorithm, which provides a simple framework for developing stable 
and efficient acceleration methods of EM. The zigzagging problem of SOR, a special case 
of DECME, motivated the development of the three new DECME implementations, i.e., 
DECME_vl-v3. The stability of DECME is guaranteed by the nested EM iteration. The 
equivalence of DECME.vl to AEM, a conjugate direction method, provides theoretical 
justification for the fast convergence of the new methods, which is also supported by the 
numerical results. Moreover, the simplicity of the new methods makes them more attractive 
than AEM. 

In optimisation literature, it is popular to analyse the convergence of optimisation algo- 
rithms near the optimum with the assumption of an ideal exact line search. However, exact 
line search is not a realistic choice in practise (see the discussion in Appendix IE)) . Hence, the 
relative performance of DECME.vl and AEM, in both efficiency and stability, could be very 
different, especially when the starting point is far from th e MLE. Note also that many works 



hav e been done to exp edite the line search for SOR, e.g.. ISalakhutdinov and Roweis (20031 ) 



and iHesterberg (20051 ). It will be interesting to see how the similar techniques perform for 
DECME for we have shown that the newly proposed acceleration directions work much 
better than that used by SOR. 

Our main focus in the current paper has been on accelerating EM. However, it is note- 
worthy that the proof of Theorem 13.31 only depends on the linear convergence rate of the 
underlying algorithm being accelerated rather than its specific structure. Hence an immedi- 
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ate point to make is that the new methods should also work for ot her EM-type algorithms o f 
linear convergence rate or more broadly for the MM algorithm (IHunter and Lange. 2004 ). 
We leave this problem open for future investigation. 
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Appendix 

A. The Linear Convergence Rate of EM: a Quick Review 

This section reviews some well known convergen ce properties of EM to esta blish necessary 



notations. These results are mainly adapted from lDempster et al. (19771 ) and lMeng and Rubin (1 

In a small neighbourhood of the MLE, the observed log-likelihood L(9\Y b s ) may be 
assumed to be a quadratic function: 

L(6\Y obs ) = ~\{0- 9)'I obs (6 - §). (5) 



Under this assumption, iDempster et al. (19771) proved that EM has a linear convergence 



rate determined by DM EM , i.e., equation ([T]). We mentioned previously that DM EM is 
called the missing information fraction. It is named after the following identity: 

DM EM = I p - IJo m Iobs = Icmnlmis, (6) 

where I p represents the identity matrix of order p, I Q bs and I CO m are the negative Hessian 
matrices of L(9\Y b s ) and Q(9\Y b s ,9) at the MLE, and J TO j S = I com — I b s - The matri- 
ces I bs, Imis and I C om are usually called observed-data, missing-data, and complete-data 
information matrices. We assume that these matrices are positive definite in this paper. 

1/2 

Since I com is positive definite, there exists a positive definite matrix, denoted by Icom, 
such that Icom = Iclmlcom- Further denote by Icom the inverse of iUm- Then Idiots is 

similar to Icom^l com^-obs^ Icom — Icom lobslcom and, thereby, I com I bs and Icom lobslcom 

1/2 1/2. 

have the same eigenvalues. Since I C om lobslcom is symmetric, there exists an orthogonal 
matrix T such that 

Icom lobslcom = TAT , (7) 

where A = diag(Ai, • • • , A p ), and A,;, i = 1, • • • ,p, are the eigenvalues of I^^Iobs- Therefore, 

t— 1 j T^^/^T^XT^ ( Q\ 

1 com 1 obs — L com 1 ilJ 1 com- \°) 

Let P = Icom 2 T, then we have I C0 \ n I b s = PAP 1 . Furthermore, the columns of P and the 
rows of P _1 are eigenvectors of Ic~ \rJobs and lobslcom^ respectively. Define r\ = — 9), 

then from equation ([!]) we have rjt = (Ip — A)rjt—i, or equivalently 



Vt,i = (1 - Ai)i7t_i,j, i = 1, • • • ,p. 



(9) 
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Equation © implies that EM converges independently along the p eigenvector directions 
of Icom^obs (or equivalently DM EM ) with the rates determined by the corresponding eigen- 
values. For simplicity of the later discussion, we assume 1 > Ai > A2 > • • • > A p > and 
Vo,i ^ 0, i = 1, •• • ,p. 

B. The Conservative Step Size of EM: Proof of Theorem |3?TI 

From equation (0 and the definition of SOR in Section 15721 it is easy to show that 

a t = ( t- t-i)'lo b s(0-0t) (1Q) 
[fit — 6t-l)'Iobs{6t — Ot-i) 

Then making use of the fact that 9 t = 8t-i + Icom^obs(9 — 9t-i) (followed from equations [T] 
and[S]) leads to 

(9 - 6 t -l)' Iobslcomtobsifi - Qt-i) , 

a t — — = — — - 1. (11) 

[9 — 9t-l)' ' Iobslcomlobslcomlobsifi — 6t-l) 

By definition of 77, we have 9 — 9~t-\ = IcomTrjt—i- Making use of equation and the fact 
that T is an orthogonal matrix yields 

a t = — — tjz 1. (12) 

Since A is diagonal and all its diagonal elements are between and 1, it follows immediately 
that a t > 0. □ 



C. The Convergence of SOR: Proof of Theorem [ 

Similar to equation (TjQ) and © for EM, we have the following results for SOR: 

9 - 9t = [I P - (1 + at)I- o l m I obs ]0 - 9 t -i), (13) 

and 

fjt,i = [1 - (1 + a t )\i\fjt-i,i , i = !,■■•, p. (14) 
For p = 2, from equation (|12l) . we have 

a * = T3^2 -T3^2 1» ( 15 ) 

A lVt-l,l + A 2%-1,2 

and then, 

1 ri_L u A|(A 2 - Ai)r) t 2 _^ 2 A 2 (Ai - A 2 )?? t 2 _ 1 x 

1 - (1 + Ot)Ai = -T3^2 > 1 - (! + a *)A 2 = Tg^ — rg^g • (16) 

From equation (| 14[) and equation (|16[) . we have 
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It follows that fjt^i/fit^ = Vt—2,i/Vt-2,2- Furthermore, from equation (|15p. we have 



at ~ A3(*_ 1 .i/i*-i,2) 2 + A3 " (18) 

and immediately at = «t_ 2 , which proves conclusion 1. 

Now define a trivial algorithm, called SOR2, where each iteration of SOR2 includes two 
iterations of SOR. From equation (jT3j) , we have 



- Ot+i = [h - (1 + a t )I-^ m I obs }[l2 - (1 + at-JI-llobsKO - Ot-i). (19) 

By conclusion 1, [I p - (1 + a t )I~^ m I obs \[I p - (1 + a t -i)-^n-fo6«] is a constant matrix and 
denote it by DM SOR2 , which obviously determines the convergence rate of SOR2. By using 
equation ©, we have DM som = / c " ^ 2 r[/ p -(l + a. f )A][/ p -(l + a t _i)A]r7 c 1 / , 2 ! .. Moreover, 
with equation (fTTf and (fT5)l . it is easy to show that 

[l-(l + a t )X j ][l-(l + a t . 1 )X j } = (A2 /A ~ 2 i l)2 x2e r , .7 = 1,2. (20) 

V A 2 r 't-l,2 A l r 't-l,l / 

R follows that DM som = [l-(l+a t )Ai][l-(l + Q! t _i)Ai]J2, which means SOR2 converges 
with the same rate [1 — (1 + at)Ai][l — (1 + at-i)X\\ along any direction. From equation 
(|2"U| , it is easy to see that 

[!-(! + a 4 )Ax][l - (1 + a^OAx] < ^ " ^ = (1 - --^V) 2 < (1 - A 2 ) 2 . 

(Ai + XiY Ai + A 2 

Note that (Ai — A 2 )/(Ai + A 2 ) is the optimal convergence rate of SORF and that 1 — A 2 is 
the convergence rate of EM. Hence conclusion 2 follows. 

Since Ai > A 2 , equation (|TB)) implies that 1 — (1 + a t )Ai < and 1 — (1 + a t )A 2 > 0. So 
from equation (|14[) . we have f?t,if?t-i,l < an d fjt. 2Vt-i,2 > 0. This proves the first statement 
in conclusion 3. Note that 6 t+ 'i-9 t = {I-DM 80 * 2 ^-^) az6-6 t -i. Hence 6 t+1 -6 t -i 
is parallel to 6 — 9 t -i, which concludes the second statement in conclusion 3. □ 

D. The Convergence of DECME_v1 : Proof of Theorem I3T31 

We prove t his by induction. This version of proof is similar to the proof of the PARTAN 
theorem in iLuenbereer (20031 pp. 255-256). However, the difference between a generalised 
gradient direction and the gradient direction should be taken into account. 

It is certainly true for t = 1 since the first iteration is a line search along the EM direction 
for both DECME.vl and AEM. 

Now suppose that 0q, 9i, ■ ■ ■ , 9 t -\ have been generated by AEM and 9 t is determined 
by DECME.vl. We want to show that 9 t is the same point as that generated by another 
iteration of AEM. For this to be true 9t must be the point that maximises L(9\Y b s ) over 
the two-dimensional plane 9 t -\ + {9t-x — 9t-2,@t — #t-i}- Since we assume that L{9\Y b s ) is 
a quadratic function with a positive definite Hessian matrix, L(9\Y i, s ) is strictly convex and 
we only need to prove g t (gradient of L(9\Y b s ) a,t9 t ) is orthogonal to 9 t -\— 9t-2 and 9t—9t-i, 
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or equivalently 9f OR — 9 t -2 and 9 t — 9t-x- Since 9t maximises L(9\Y t, s ) along 9f OR — 9t-2, 
g t is orthogonal to Of OR ' — 6t-2- Similarly, gf is orthogonal to Ot — Ot-x- Furthermore, we 
have g' t _ 2 (0 t - t _i) =(§- Ot-2)' ^/"^(fl - § t -i) = (0 t -i - t ^)'~g t ., = 0, w here the 
last identity is true due to the Expanding Subspace Theorem ( Luenberger. 20031 p. 241) 
for the conjugate direction methods. Then g' t {0f OR - t -i) = 0- 9 t )'I obs (9f OR - 9 t -i) = 
[0-0 t ^ 2 -(l + a[ 2) )(0f OR -0^ 2 )}'I o U0f OR -^ 

0f OR )'I obs ]'{0f OR -0 t - X ) = [-af ) g t - 2 + (l + a? ) )gf OR ]'{0f OR -0 t - 1 ) = 0. It follows that 
g t is orthogonal to 0f OR - t -i- n 



E. Implementation of Line Search 

In practise, it is neither computationally feasible nor necessary to conduct exact line search. 
In fact we can often achieve higher efficiency by sacrificing accuracy in the line search rou- 
tine, although the number of iterations may increase. There are various criterions for 



terminating the line search routine for a desirable trade-off ([Luenberger. 20031 pp. 211-214) 



and different approaches h ave been proposed for efficient implementation of those criterions 



(jMore and Thuente. 19941 ). Furthermore, some transformations to transform constrained 



problems into unconstrained problems can be useful, as discussed in lSalakhutdinov and Roweis (2( 

Here we take a different approach to implement the line search by taking the advantage 
of the fact that sometimes it is easy to figure out the feasible region of a constrained problem 
along a single line. After the feasible region is computed, many commonly used line search 
routines available in standard software can be easily applied. In our case the line search is 
conducted with the optimize function in R by passing the computed interval to the optimize 
function through its option interval =. Note that the interval computed in this way is 
usually very wide and some other information may be used to narrow it down for higher 
efficiency. For example, we can start the line search by forcing a > for SOR. Furthermore, 
we control the accuracy of the line search by setting tol = 0.01 in the optimize function. 
This choice is somehow arbitrary. One advantage is that the line search is forced to be more 
accurate when the algorithm approaches the MLE since the magnitude of the differences 
between consecutive estimates usually becomes smaller with the progress of the algorithm. 

For the constraints involved in the examples, we summarise the methods to obtain the 
feasible region as follows. Denote the current estimation by and the search direction by 
d. Our goal is to find the feasible region of a (an interval including for the examples used 
in this paper) for a univariate function f(0 + ad). If there are several sets of constraints 
for one model, we can determine the feasible region induced by each of them and then 
take their intersection. Without loss of generality, we assume in the following that d is the 
counterpart of the discussed parameters in the vector representing the search direction. 

1.) The degree of freedom v in the t distribution. It is easy to compute the boundary for 
a such that v + ad > 0. 



2.) The mixing coefficients, 7r.;, i = 1, • • • , K, in the mixture model. There are two types 
of constraints here, i.e., 2i=i 7r i = 1 an< i ^ — 0- By using the first constraint, we 
only need to consider the first K — 1 coefficients with constraints J^i^ 1 n i ^ 1 an d 
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TTj > 0, i = 1, • • • , K — 1. Then we only need to find the intersection of the solutions 
for the inequalities 2 i=1 ^ + a X^i* — 1 an d + > 0; i = 1, - ■ ■ ,K — 1. 

3. ) The variance components in the linear mixed-effects model and the mixture model 

and the uniquenesses in the factor analysis model. This can be handled in the same 
way as that for the degree of freedom in the t distribution. 

4. ) The covariance matrices in the linear mixed-effects model and the t distribution. For 

the current paper, only two-dimensional covariance matrices are involved. A two- 
dimensional matrix is positive definite if and only if &i t i > and det('I') > 0. 
Hence we only need to guarantee + ad\ t \ > and det(\l/ + aD) > (assume D is 
the matrix generated from the vector d in the same way as is generated from 6). For 
other covariance matrices of fairly small size, similar method could be used. When the 
dimension of the covariance matrix is high, it is a common practise to enforce certain 
structure on the matrix. For example, in spatial statistics, the covariance matrices 
are usually assumed to be generated from various covari ance functions with very few 
parameters ( Zhang. 20021 : Zhu et al.. 2005 ; Zhang. 2007 ) and the feasible region of a 
can be easily obtained. 
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Table 1. Eigenvalues of DM EM and DM ECME for the Linear Mixed-effects 
Model Example in Section [2TTI 

Algorithm Eigenvalues of the missing information fraction 

"EM 0.9860 0.9746 0.7888 0.6706 0.5176 0.3874 0.3260 0.2710 0.0364 

ECME 0.5176 0.3874 0.3260 0.2710 0.0364 0.0000 0.0000 0.0000 0.0000 



Table 2. The Four Largest Eigenvalues and the Corresponding Eigenvectors of 
DM EM for the Linear Mixed-effects Model Example in Section I27T1 

Eigenvalue Corresponding eigenvector 

0.9860 (0.0000 0.0000 -0.0413 -0.9991 0.0000 0.0000 0.0000 0.0000 0.0000) 
0.9746 (0.0413 0.9991 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000) 

0.7888 (0.0000 0.0000 -0.0433 0.9991 0.0000 0.0000 0.0000 0.0000 0.0000) 

0.6706 (-0.0433 0.9991 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000) 

Table 3. The Ten Leading Eigenvalues of DM EM and DM ECME for the Factor Analysis 
Model Example in Section l2T2l 

Algorithm Ten leading eigenvalues of the missing information fraction 

EM 1-2E-12 0.9992 0.9651 0.9492 0.9318 0.8972 0.8699 0.8232 0.8197 0.7876 

ECME-1 1-2E-12 0.9979 0.9509 0.9292 0.9124 0.8725 0.8480 0.8031 0.7877 0.7539 
ECME-2 0.9987 0.8715 0.7321 0.6673 0.5184 0.4770 0.4496 0.3727 0.3369 0.0000 



Table 4. The Two Largest Eigenvalues and the Corresponding Eigenvectors of 
DM EM for the Factor Analysis Model Example in SectionET2l 

Eigenvalue Corresponding eigenvector 

1-2E-12 0.0812 0.0934 -0.4897 -0.1335 0.0684 0.0748 0.0363 -0.0864 -0.0949 

0.0996 0.1288 0.7171 0.1962 0.1085 0.1138 0.0954 0.2099 0.2047 
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 
0.9992 0.0046 0.0057 -0.0047 -0.0005 -0.0047 -0.0034 -0.0038 -0.0013 0.0005 

-0.0062 -0.0071 -0.0092 -0.0064 0.0053 0.0060 0.0045 0.0049 0.0034 
0.0267 0.0441 -0.2871 0.0013 -0.0103 -0.0177 -0.0106 -0.0139 -0.0144 
-0.0028 0.0079 -0.9557 0.0111 0.0013 0.0040 -0.0011 -0.0028 0.0006 



Table 5. Comparison of Convergence for the Three Examples in Section [4~2l 14.31 and !4.4l 



Number of iterations 



Example 


EM 


ECME 


SOR 


DECME.vl 


DECME_v2 


DECME_v3 


Mixed effect: EM 


5,968 


/ 


918 


104 


133 


166 


Mixed effect: ECME 


/ 


20 


15 


9 


13 


15 


Factor analysis 


6,672 


/ 


1,698 


55 


91 


150 


Bivariate t 


293 


/ 


96 


32 


48 


63 


Table 6. Comparison of Convergence for the Three Examples in Section[4~2l 14.31 and!4.4l cont'd 










CPU time (s) 






Example 


EM 


ECME 


SOR 


DECME.vl 


DECME_v2 


DECME_v3 


Mixed effect: EM 


518.9 


/ 


110.9 


15.7 


15.6 


19.3 


Mixed effect: ECME 


/ 


1.8 


1.8 


1.3 


1.6 


1.8 


Factor analysis 


33.4 


/ 


21.3 


1.2 


1.3 


2.0 


Bivariate t 


1.4 


/ 


1.5 


0.9 


0.9 


1.1 
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Comparison for the Mixed Effect Model Example: 
EM and Its Accelerators 




"i i i i i i i i i i 

5 10 20 50 100 200 500 1000 2000 5000 



Iteration 

Fig. 1. Comparison for the Linear Mixed-effects Model Example in Section [27TI and Section H~2l 
Displayed are increases in L(0\Y obs ). 



Comparison for the Mixed Effect Model Example: 
ECME and Its Accelerators 
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Fig. 2. Comparison for the Linear Mixed-effects Model Example in Section IzTI and Section l4~2l 
Displayed are increases in L(0\Y obB ), cont'd 



The DECME Algorithm 21 



Comparison of the Paths 
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Fig. 3. Comparison of the Paths of EM, SOR, and DECME_v1 for a Two-dimensional Example. 
The eigenvalues of DM EM are 0.9684 and 0.6232; the darkviolet cross on the upright corner shows 
the directions of the two eigenvectors of DM BM ; The red dashed lines represent the true path of 
DECME_v1 in its second iteration. 




Fig. 4. Illustration for One Iteration of the DECME Implementations. 
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Relaxation Factor: p=2 
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Relaxation Factor for the Mixed Effect Model Example 
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Fig. 5. Relaxation Factor a t Generated from SOR. The top panel plots the sequence of a t for the 
two-dimensional example used to generate Figure [3] and the bottom panel plots the sequence of 
a t from the simulated nine-dimensional example in Section \32\ by using information from the linear 
mixed-effects model example in Section[2j]and Section l4~2l . 



Comparison for the Factor Analysis Model Example 
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Fig. 6. Comparison for the Factor Analysis Model Example in Section l2~2l and I4.3I Displayed are 
increases in L(6\Y obs ). 
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Fig. 7. Comparison for the Bivariate t Example in Section E~4l Displayed are increases in L(d\Y obs ). 



Comparison for the Gaussian Mixture 
Example: No. of Iterations 
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Fig. 8. Comparison for the Gaussian Mixture Example in Section I4.5I Displayed are n umber of 
iterations; the smoothed curves are generated by robust local regression Icieveland, 19791) . 
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Comparison for the Gaussian Mixture 
Example: CPU Time 
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Fig. 9. Comparison for the Gaussian Mixture Example in Section l4~5l Displayed are CPU time; the 
smoothed curves are generated by robust local regression. 



